Skip to content

convert-sha256: one-off SHA1 → SHA256 repo conversion#66

Open
nodo wants to merge 19 commits into
mainfrom
nodo/convert-sha256
Open

convert-sha256: one-off SHA1 → SHA256 repo conversion#66
nodo wants to merge 19 commits into
mainfrom
nodo/convert-sha256

Conversation

@nodo
Copy link
Copy Markdown
Collaborator

@nodo nodo commented May 25, 2026

Summary

New git-sync convert-sha256 subcommand. One-off conversion: fetches a pack over smart HTTP from a SHA1 source, walks every reachable object, and writes a fresh SHA256 bare repository where every tree/commit/tag reference is re-encoded under SHA256. All branches and tags are always converted (partial scope risks stranding cross-branch references in messages). Behavior, sharp edges, and operational characteristics documented in docs/convert-sha256.md.

Design: two-pass topological DFS

The translation needs an authoritative "what's in scope" view before encoding starts, so abbreviated SHA1 prefixes in commit messages get the same uniqueness verdict regardless of how far translation has progressed, and so cross-branch message references (cherry-picks, reverts, etc.) can be added as dependency edges in the DFS:

  1. Discovery: a non-encoding DFS from each desired ref tip walks every reachable object via tree entries, commit tree/parent links, and tag targets. Produces a reachable map[SHA1]ObjectType covering the full in-scope set. Submodule gitlinks pointing outside the repo error here, before the target bare repo is materialized — failing fast keeps half-converted state off disk.
  2. Translation: memoized recursive DFS in topological order. Tree, parent, tag-target, and message-reference edges are all part of the DFS, so the SHA1 → SHA256 mapping contains every dependency by the time any object's bytes are encoded. Single forward sweep — no cascade re-encoding, no per-commit dirty bookkeeping.

Cycles in the combined dependency graph are cryptographically infeasible (forming a cycle via message references would require knowing each hash before computing it), so the topological order is always valid. A defensive inProgress guard turns unexpected graph shapes into a clear error rather than a stack overflow.

Loose objects are written by hand — go-git@v6.0.0-alpha.3's objfile.Writer hardcodes SHA1 in its hasher, so the standard SetEncodedObject path would place every translated object at a SHA1-derived filename even on a SHA256 store. A unit test recomputes sha256 of every loose object's decompressed content and compares against the filename to prevent regression.

Signing (--sign)

After conversion, --sign shells out to git tag -s converted/<branch> <tip> for every converted branch. Each resulting signed annotated tag transitively attests the entire reachable history of that branch via the parent chain: tampering with any historical commit changes ancestor hashes, which changes the tip's hash, which changes the tag's object line, which invalidates the signature.

The signature is by the converter, not by the original authors — the originals' signatures are necessarily lost (they signed over pre-rewrite bytes). For internal mirrors and single-identity repos that's a strict improvement over unsigned-everywhere; for broad public repos it's weaker than the pre-conversion chain. SSH signing (gpg.format = ssh) and OpenPGP both work — we delegate to git, so whatever the user's git config picks is honored. --sign-key <id> overrides the default key.

Example output

--sign --check against entireio/cli (283 refs, 6,950 commits, 55,289 objects):

$ OUT=$(mktemp -d) && ./git-sync convert-sha256 --sign --check https://github.com/entireio/cli.git "$OUT"
fetching 283 ref(s) from https://github.com/entireio/cli.git ...
discovering reachable objects ...
translating objects to sha256 ...
signing refs/tags/converted/1072-intermittent-stop-hook-timeout-... ...
signing refs/tags/converted/20260116-git-hook-absolute-path ...
signing refs/tags/converted/20260119-do-not-pin-go ...
... (one line per branch tip)
verifying output ...
  ✓ config: extensions.objectformat = sha256
  ✓ HEAD: 11a78c5d7ddc121d551d9bbf5e8fe7de2b329da77334cde2f2131ec44b1ef83d
  ✓ refs: 283 / 283 resolve to objects
  ✓ git fsck --full: clean
sha256 bare repo: /var/folders/.../tmp.S5PP3dNay8
source: https://github.com/entireio/cli.git (v2)
converted: 22989 blobs, 25350 trees, 6950 commits, 0 tags
refs written: 283
warning: stripped 4042 GPG signature(s); they no longer match the rewritten object content
rewrote 48 SHA1 hash reference(s) in commit/tag messages
origin notes ref: refs/notes/sha1-origin (use `git notes --ref=sha1-origin show <sha256>` to recover old SHA1)
signed 213 branch attestation tag(s): refs/tags/converted/1072-..., refs/tags/converted/20260116-..., refs/tags/converted/20260119-..., refs/tags/converted/20260128-..., refs/tags/converted/20260206-..., ... (208 more; full list in --json)

Inspecting one of the signed tags:

$ git -C "$OUT" cat-file -p refs/tags/converted/main
object 11a78c5d7ddc121d551d9bbf5e8fe7de2b329da77334cde2f2131ec44b1ef83d
type commit
tag converted/main
tagger Andrea Nodari <and.nodari@gmail.com> 1779698132 +0200

SHA1 → SHA256 conversion attestation for refs/heads/main.

Source: https://github.com/entireio/cli.git
Produced by git-sync convert-sha256.
-----BEGIN SSH SIGNATURE-----
U1NIU0lHAAAAAQAAADMAAAALc3NoLWVkMjU1MTkAAAAg7xes3P5ZXspRJNdboZZaUN+GZK
PowQqx6UX8bp0evB4AAAADZ2l0AAAAAAAAAAZzaGE1MTIAAABTAAAAC3NzaC1lZDI1NTE5
AAAAQLn2hdEdYiTF5MFug8L+mVD6fU/ZDsZF9nm6a6MHqIrGchMHxnkeN6kLPG4UdAKyxD
uk/lW8VIW/iErg9K46vA8=
-----END SSH SIGNATURE-----

git verify-tag refs/tags/converted/main (against the user's allowed-signers file) then verifies the entire history.

Other flags

  • --all-refs / --exclude-ref-prefix — extend scope to notes/pulls/etc., or subtract
  • --write-mapping <path> — emit SHA1 → SHA256 TSV for rewriting external systems (PR descriptions, deploy manifests, etc.)
  • --no-rewrite-messages — skip inline rewriting of hash references in commit/tag messages
  • --no-origin-notes — skip refs/notes/sha1-origin (per-commit reverse lookup of original SHA1)
  • --check — post-conversion sanity steps (config, HEAD resolves, refs resolve, git fsck --full)
  • --progress — live per-phase object counts (TTY only)
  • --sign-key <id> — override user.signingkey for --sign

Full reference in docs/convert-sha256.md.

Test plan

  • go test ./... passes
  • go test -race ./cmd/git-sync/internal/sha256convert/ clean (atomic counters used so the --progress ticker doesn't race against translation)
  • Unit tests cover the translator end-to-end: blobs/trees/commits/tags including signed commit + signed annotated tag, idempotency, on-disk sha256(content) invariant for every loose object, message rewriting (ancestor + cross-branch + ambiguous + skip), origin notes, mapping file, submodule fail-fast in discovery
  • Integration test against local git-http-backend (gated on GITSYNC_E2E_SHA256_HTTP_BACKEND=1): exercises pack fetch + discovery + translation + rewriting + notes + mapping + --check end-to-end, asserts git fsck clean
  • Signing integration test: spins up an ephemeral ed25519 SSH key via ssh-keygen, points GIT_CONFIG_GLOBAL at a generated gitconfig with SSH signing, runs convert-sha256 --sign, asserts the resulting refs/tags/converted/main is a signed annotated tag targeting the converted branch tip
  • Smoke-tested manually against spf13/cobra (37 refs, ~4,500 objects) and entireio/cli (283 refs, ~55,000 objects); git fsck --full clean; git verify-tag refs/tags/converted/main passes against an SSH allowed-signers file

🤖 Generated with Claude Code


Note

High Risk
Large new conversion path touches object encoding, ref policy, and optional signing; mistakes could produce corrupt or misleading repos, though submodule/side-output guards and --check mitigate operational risk.

Overview
Adds git-sync convert-sha256, a one-off migration that pulls a SHA1 repo over smart HTTP into a new empty SHA256 bare directory, re-encoding all reachable objects and refs (branches/tags always; optional --all-refs / --exclude-ref-prefix with guards so heads/tags cannot be dropped).

The implementation lives in cmd/git-sync/internal/sha256convert: temp SHA1 fetch → reachability discovery (submodule gitlinks fail before the target is created) → manual SHA256 loose-object writes (workaround for go-git’s SHA1 object writer) → ref/HEAD setup, optional message hash rewrites, refs/notes/sha1-origin, TSV mapping, --sign attestation tags, and --check. The Cobra layer prints partial results on failure and fixes positional vs flag arg resolution.

Docs/README document usage and sharp edges (signatures stripped, submodules). Tests cover the translator, CLI arg resolution, and gated git-http-backend e2e; golangci allows gitproto.Conn / AuthMethod return types.

Reviewed by Cursor Bugbot for commit 26f70ca. Configure here.

nodo added 6 commits May 23, 2026 01:40
Entire-Checkpoint: 198bc99d4b0b
Entire-Checkpoint: dc4592acaadf
Entire-Checkpoint: fb6dc8f73ee0
Entire-Checkpoint: 80a4babf00ee
This allows to "sign" the conversion.

Entire-Checkpoint: 1857d92ce8c7
golangci-lint v2.11.4 was unhappy across errcheck, exhaustive,
gocritic, inamedparam, ireturn, maintidx, noctx, perfsprint, revive,
and wrapcheck. Mostly mechanical:

- Test helpers (initSHA1, initSHA256, mustTranslator) collapse the
  '_ :=' patterns into t.Fatalf-on-error wrappers.
- exec.Command -> exec.CommandContext(ctx, ...); ctx threaded into
  runChecks and signBranchTips.
- refs.ForEach return value checked instead of discarded.
- if/else chain in runChecks rewritten as switch.
- ireturn for openSource/normalizeAuth annotated, since both return
  shared transport interfaces by design.
- maintidx on Run annotated; the function is a phase orchestrator,
  splitting it would obscure the pipeline.
- exhaustive switches on plumbing.ObjectType annotated; the
  unhandled cases (OFSDelta/REFDelta/AnyObject/InvalidObject) can't
  reach a resolved storer.
- Errors from io.ReadAll/MemoryObject.Reader/bufio.Flush/fmt.Fprintln/
  auth.Method.Authorizer wrapped with fmt.Errorf.
- Constant 'max' renamed to package-level 'previewMax' to stop
  shadowing the builtin in two places.
- Named parameter added to interface methods that lint flagged.
- Two fmt.Sprintf calls replaced with string concatenation.

Entire-Checkpoint: 43bf90985cc5
@nodo
Copy link
Copy Markdown
Collaborator Author

nodo commented May 25, 2026

bugbot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 2b36d81. Configure here.

nodo and others added 10 commits May 26, 2026 10:03
The previous substring check would pass on a commented
`# objectformat = sha256` line or an `oldobjectformat = sha256`
key in another section. Use go-git's config decoder to look up
`extensions.objectformat` properly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 122fd96da7bf
Substring-matching "error" against the full fsck output could
false-positive on benign dangling/warning lines that happened
to contain that word (e.g. a branch or path with "error" in
it). Scan line-by-line for "error:" / "fatal:" / "missing "
/ "broken link" / "bad " prefixes, which is what git itself
uses to signal real problems.

Entire-Checkpoint: f16d85f433d8
translate() is the single recursive entry point invoked from
the tree-entry, commit-parent, tag-target, and message-ref
loops. Check ctx.Err() at the top so Ctrl-C during a
kernel-scale conversion returns promptly instead of running
the whole DFS to completion. ctx is plumbed through
newTranslator and stored on the translator (signatures of the
many translateX helpers stay untouched).

Entire-Checkpoint: 354e2e640545
Some HTTP v1 servers don't return HEAD in their info/refs.
Previously HEAD was left at PlainInit's default refs/heads/master,
which then fails the --check HEAD step on any main-default repo.
pickHEAD now selects in order: server-advertised HEAD (resolved via
the matched DesiredRef.TargetRef so user-supplied ref mappings are
honored), refs/heads/main, refs/heads/master, then the
lexicographically first branch. Tags-only conversions still leave
HEAD at the PlainInit default since there's no sensible branch to
point at.

Entire-Checkpoint: 161329c78d89
- signBranchTips: explain the os.Stderr departure from req.Out;
  TTY inheritance is required for gpg/ssh-agent prompts.
- Don't double-count when a commit/tag has both Signature and
  SignatureSHA256 (two encodings of the same signature). Relabel
  the warning to "signature(s) / mergetag header(s)" so the count
  matches what it actually represents.
- writeOriginNotes now honors SOURCE_DATE_EPOCH and otherwise
  pins the wrapper-commit timestamp to the Unix epoch, so the
  notes-ref hash is reproducible across runs over identical
  source state. The timestamp is bookkeeping; it carries no
  information about the underlying SHA1 history.
- writeLoose uses zlib level 1, matching git's
  core.looseCompression default — loose objects are short-lived
  before gc packs them, so write speed > size.
- Clarify encodeBody's "scratch MemoryObject" comment so the
  unused format argument is no longer mystifying on re-read.

Entire-Checkpoint: cabf3e19b5f2
The convert-sha256 docs promise that every branch and tag is
always converted, since dropping any of them risks stranding
cross-branch SHA1 references in commit and tag messages — the
exact invariant the message-rewrite pass exists to maintain.
The previous code piped req.ExcludeRefPrefixes straight into
planner.BuildDesiredRefs, which applies it to branch and tag
selection too, so the flag silently broke the promise.

Validate at the top of Run: refuse any prefix that, under the
planner's HasPrefix matching, would catch a refs/heads/* or
refs/tags/* name. The check covers bare "", "refs/", partial
"refs/h", whole "refs/heads/" / "refs/tags/", and any sub-
namespace under either. Help text on convert-sha256 now also
calls the rejection out.

Entire-Checkpoint: ec9a87c6c8f1
writeRefs lands the source ref set on the target, then
writeOriginNotes and signBranchTips publish their own refs on
top. A source repo that already advertised refs/notes/sha1-origin
(under --all-refs) or any refs/tags/converted/* (always, since
tags are mandatory) would have those refs silently clobbered.

Add checkSideOutputCollision: run after planner.BuildDesiredRefs
and before any object work, refusing with an actionable message
that names the offending ref(s) and points at --no-origin-notes
/ --exclude-ref-prefix / dropping --sign as escapes.

Also tighten --check to skip *only* the side-output refs we
actually wrote: pass the {origin notes, signed tags} set into
runChecks instead of pattern-matching by prefix, so a legitimate
source ref that happened to share a namespace is not silently
hidden from the resolved/expected fraction.

Entire-Checkpoint: 250149a84f90
CI-blocking gofmt diff on the refs/notes/sha1-origin row.

Entire-Checkpoint: 0ae6b701b541
The previous code assigned args[0] to SourceURL and args[1] to
TargetDir unconditionally, so `--source-url <url> <dir>` left
TargetDir empty (len(args) == 1, so args[1] was never read) and
failed the "requires a source URL and a target directory" check.

Consume positionals left-to-right against fields the user has
not yet supplied via flags. The four flag/positional shapes
(zero/one/two flags) now all yield the correct assignment.
Extracted into resolveConvertSHA256Args for a unit test that
covers every shape including the regression case.

Entire-Checkpoint: 60184126410a
These are project-wide transport/auth interfaces, the same
shape as the auth.Method entry that's already on the ireturn
allowlist. Without the entries, mise run lint (which runs
golangci-lint --fix) was racing: nolintlint stripped the
//nolint:ireturn directives on openSource / normalizeAuth /
newConn before ireturn flagged the functions, leaving lint red
on every run. With the types allowed by policy, the directives
become dead weight — drop them so the source documents intent
once, in .golangci.yaml.

Entire-Checkpoint: 9fa947bf886d
@nodo
Copy link
Copy Markdown
Collaborator Author

nodo commented May 26, 2026

bugbot run

Comment thread cmd/git-sync/internal/sha256convert/sha256convert.go
nodo added 3 commits May 26, 2026 11:37
1. Submodule gitlinks: drop the "vendored" carve-out. Even when
   the linked-to commit lives in the source store, rewriting the
   gitlink to SHA256 produces a tree that fsck-passes but breaks
   `git submodule update` forever — the .gitmodules upstream
   still advertises only SHA1. Refuse any submodule gitlink in
   discoverReachable; keep a defensive guard in translateTree.

2. --check HEAD on tag-only conversions: pickHEAD returns ""
   when no branches landed, so HEAD stays at the PlainInit
   default refs/heads/master and the HEAD check guarantees a
   failure after an otherwise successful run. runChecks now
   takes a hasBranches bool and marks HEAD as skipped when
   false, with a "tags-only conversion" detail.

3. Partial signed-tags list dropped on error: signBranchTips
   returns the tags it created before failing, but Run was
   assigning res.SignedTags = signed only on the success path
   and the cobra wrapper dropped result entirely on err. Assign
   res.SignedTags before the err check, and have the cobra
   wrapper print the partial result on error so users see which
   converted/* tags landed and need cleanup.

4. --keep-source-objects on error paths: the flag's whole
   purpose is debugging failed conversions, but cleanupTemp was
   only flipped at the *end* of Run, so every error before that
   wiped the temp store. Hoist the cleanupTemp = false / res.TempDir
   assignment to right after MkdirTemp, and propagate `res`
   through every subsequent error return so the kept path
   surfaces in both Result and Lines() output.

Entire-Checkpoint: 451aea1c737a
5. discoverReachable now takes ctx and checks ctx.Err() at the
   top of every visit. Without it, Ctrl-C during the multi-minute
   discovery phase on kernel-scale repos was ignored — exactly
   the case the per-translate() check was added for.

6. writeLoose: document the durability tradeoff. No fsync, by
   design — convert-sha256 is a one-shot bulk operation, not
   incremental, processes millions of objects, and Run wipes the
   target on error so the only supported recovery is rerunning
   from clean state.

7. Target directory cleanup on error. ensureEmptyTarget refuses
   to write into a non-empty dir; without this fix, any failure
   after PlainInit left config/HEAD/refs behind and blocked a
   retry with no recovery hint. New deferred cleanup arms after
   PlainInit, disarms on success, and is suppressed by
   --keep-source-objects so users can inspect partial state.
   A --check failure also disarms cleanup since the conversion
   itself finished and the partial target is what the user needs
   to inspect.

8. hashPattern is now case-insensitive ((?i) prepended), and
   resolveMessageRef lowercases the input before lookup. An
   uppercase/mixed-case SHA1 reference in a commit or tag
   message is now rewritten the same as a lowercase one.

9. Check gained a Skipped bool. Skipped implies OK so callers
   gating on OK still treat it as non-fatal; callers needing a
   stricter audit signal can branch on Skipped first. Applied
   to the fsck-when-git-missing and HEAD-on-tags-only paths,
   with a "○" glyph in the progress output to distinguish from
   real passes.

Tests cover discovery cancellation, fsck skipped, uppercase
hash rewrite, and the existing tag-only HEAD check rewritten
against the Skipped field.

Entire-Checkpoint: 248018a9dd16
…ose, memoization

10. fsckHasError used bufio.Scanner with the default 64 KiB buffer,
    so a single long line (paths in repository-scale fsck output)
    would silently truncate and we'd miss whatever followed it.
    Switch to bytes.Split on raw newlines.

11. discoverReachable was recursive, which on long linear chains
    (kernel-scale: ~70k commits on the deepest single-parent path)
    grew the goroutine stack into the tens of MiB. Iterativize
    with an explicit work stack — memory now scales with the
    in-flight frontier, not the longest chain. translate() stays
    recursive: its edges are dynamic (tree entries + commit parents
    + tag targets + message-reference edges resolved against the
    in-progress mapping), and rewriting it would risk silent
    corruption of message rewrites. Documented the rationale and
    the depth math.

12. translator.dst was set in newTranslator but never read — a
    landmine inviting a future contributor to call go-git's
    SHA1-hardcoded SetEncodedObject. Field removed; the type
    assertion stays (with `_`) so a memory-backed dst is still
    refused with a clear error.

13. fsckHasError's error-line matcher was prefix-only and case-
    sensitive ("error:" / "fatal:"). Broaden to any line whose
    first token starts with "error" or "fatal" (case-insensitive),
    keep the "missing "/"broken link"/"bad " object reports for
    older git. Closer to the previous substring match's coverage
    without the path-substring false positive.

14. writeMappingFile dropped Close errors via a deferred Close.
    On NFS or quota-bound filesystems write failures surface at
    close time, not flush time, so the caller would think the
    mapping landed when it hadn't. Explicit Close on the success
    path, with the deferred Close kept as a best-effort net for
    the failure path.

15. resolveMessageRef ran a full O(len(reachable)) scan for every
    short hash prefix, hit twice per token (once by
    extractMessageReferences, once by rewriteHashesInMessage).
    reachable is frozen before translation starts, so the
    (prefix → matchResult) mapping is stable; memoize it on the
    translator. Halves the documented quadratic ceiling on
    message-token resolution.

Tests cover the long-line + case-insensitive fsck path and the
resolveMessageRef cache (second call returns the cached answer
even after reachable is mutated).

Entire-Checkpoint: bc9f9981f089
@nodo
Copy link
Copy Markdown
Collaborator Author

nodo commented May 26, 2026

bugbot run

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 26f70ca. Configure here.

Comment thread docs/convert-sha256.md
reference cycles are cryptographically infeasible), but a trip into
the guard becomes a hard error rather than a stack overflow.

Loose object writing is done by hand rather than via go-git's
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer the case on go-git/v6@v6.0.0-alpha.4:

go-git/go-git@5cab3a7

Comment thread docs/convert-sha256.md
that walks Jira tickets, PR bodies, deploy manifests, or any other
system that holds frozen SHA1 references.

### Branch-tip attestation tags (opt in via `--sign`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could switch --sign with --sign-mode which defaults to none, but then has tips (current implementation) and later could have an all, which could sign all objects, to support workflows whereby users want to enforce all commits/tags being signed.


const (
originNotesRef = "refs/notes/sha1-origin"
attestationTagPrefix = "refs/tags/converted/"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the premise here. Are there any concerns on signing tags in place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants